Crawling Deep Web Content through Query Forms

نویسندگان

Jun Liu

Zhaohui Wu

Lu Jiang

Qinghua Zheng

Xiao Liu

چکیده

This paper proposes the concept of Minimum Executable Pattern (MEP), and then presents a MEP generation method and a MEP-based Deep Web adaptive query method. The query method extends query interface from single textbox to MEP set, and generates local-optimal query by choosing a MEP and a keyword vector of the MEP. Our method overcomes the problem of “data islands” to a certain extent which results from deficiency of current methods. The experimental results on six real-world Deep Web sites show that our method outperforms existing methods in terms of query capability and applicability.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases

The Web has been rapidly “deepened” by massive databases online: Recent surveys show that while the surface Web has linked billions of static HTML pages, a far more significant amount of information is “hidden” in the deep Web, behind the query forms of searchable databases. With its myriad databases and hidden content, this deep Web is an important frontier for information search. In this pape...

متن کامل

Learning to Surface Deep Web Content

We propose a novel deep web crawling framework based on reinforcement learning. The crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and submits a selected action (query) to the environment according to Q-value. Based on the framework we develop an adaptive crawling method. Experimental results show that it outperforms the state of ...

متن کامل

Efficient Deep Web Crawling Using Reinforcement Learning

Deep web refers to the hidden part of the Web that remains unavailable for standard Web crawlers. To obtain content of Deep Web is challenging and has been acknowledged as a significant gap in the coverage of search engines. To this end, the paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as t...

متن کامل

A Novel Approach to Integrated Search Information Retrieval Technique for Hidden Web for Domain Specific Crawling

The traditional web crawlers retrieve contents from only the “Surface web” and are unable to crawl through the hidden portion of the Web containing high quality information which is dynamically generated through querying databases when the queries are submitted through a search interface. For Hidden web, most of the published research has been done to identify/detect such searchable forms and m...

متن کامل

A Task-specific Approach for Crawling the Deep Web

There is a great amount of valuable information on the web that cannot be accessed by conventional crawler engines. This portion of the web is usually known as the Deep Web or the Hidden Web. Most probably, the information of highest value contained in the deep web, is that behind web forms. In this paper, we describe a prototype hidden-web crawler able to access such content. Our approach is b...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Crawling Deep Web Content through Query Forms

نویسندگان

چکیده

منابع مشابه

A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases

Learning to Surface Deep Web Content

Efficient Deep Web Crawling Using Reinforcement Learning

A Novel Approach to Integrated Search Information Retrieval Technique for Hidden Web for Domain Specific Crawling

A Task-specific Approach for Crawling the Deep Web

عنوان ژورنال:

اشتراک گذاری